{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Split by tokens \n",
    "按令牌拆分\n",
    "\n",
    "Language models have a token limit. You should not exceed the token limit. When you split your text into chunks it is therefore a good idea to count the number of tokens. There are many tokenizers. When you count tokens in your text you should use the same tokenizer as used in the language model.\n",
    "\n",
    "语言模型具有令牌限制。不应超过令牌限制。因此，当您将文本分割成块时，计算标记的数量是一个好主意。有许多标记器。在计算文本中的标记时，应使用与语言模型中使用的相同的标记器。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## tiktoken"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_text_splitters import CharacterTextSplitter\n",
    "\n",
    "# This is a long document we can split up.\n",
    "with open(\"../../state_of_the_union.txt\") as f:\n",
    "    state_of_the_union = f.read()\n",
    "    \n",
    "text_splitter = CharacterTextSplitter.from_tiktoken_encoder(\n",
    "    encoding=\"cl100k_base\", chunk_size=100, chunk_overlap=0\n",
    ")\n",
    "texts = text_splitter.split_text(state_of_the_union)\n",
    "print(texts[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
    "\n",
    "text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(\n",
    "    model_name=\"gpt-4\",\n",
    "    chunk_size=100,\n",
    "    chunk_overlap=0,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_text_splitters import TokenTextSplitter\n",
    "\n",
    "text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)\n",
    "\n",
    "texts = text_splitter.split_text(state_of_the_union)\n",
    "print(texts[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## spaCy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_text_splitters import SpacyTextSplitter\n",
    "\n",
    "# This is a long document we can split up.\n",
    "with open(\"../../state_of_the_union.txt\") as f:\n",
    "    state_of_the_union = f.read()\n",
    "\n",
    "text_splitter = SpacyTextSplitter(chunk_size=1000)\n",
    "texts = text_splitter.split_text(state_of_the_union)\n",
    "print(texts[0])   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## SentenceTransformers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_text_splitters import SentenceTransformersTokenTextSplitter\n",
    "\n",
    "splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0)\n",
    "text = \"Lorem \"\n",
    "count_start_and_stop_tokens = 2\n",
    "text_token_count = splitter.count_tokens(text=text) - count_start_and_stop_tokens\n",
    "print(text_token_count)\n",
    "\n",
    "token_multiplier = splitter.maximum_tokens_per_chunk // text_token_count + 1\n",
    "\n",
    "# `text_to_split` does not fit in a single chunk\n",
    "text_to_split = text * token_multiplier\n",
    "\n",
    "print(f\"tokens in text to split: {splitter.count_tokens(text=text_to_split)}\")\n",
    "\n",
    "text_chunks = splitter.split_text(text=text_to_split)\n",
    "\n",
    "print(text_chunks[1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## NLTK"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_text_splitters import NLTKTextSplitter\n",
    "\n",
    "# This is a long document we can split up.\n",
    "with open(\"../../state_of_the_union.txt\") as f:\n",
    "    state_of_the_union = f.read()\n",
    "    \n",
    "text_splitter = NLTKTextSplitter(chunk_size=1000)\n",
    "texts = text_splitter.split_text(state_of_the_union)\n",
    "print(texts[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## KoNLPY"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_text_splitters import KonlpyTextSplitter\n",
    "\n",
    "# This is a long Korean document that we want to split up into its component sentences.\n",
    "with open(\"./your_korean_doc.txt\") as f:\n",
    "    korean_document = f.read()\n",
    "    \n",
    "text_splitter = KonlpyTextSplitter()\n",
    "texts = text_splitter.split_text(korean_document)\n",
    "# The sentences are split with \"\\n\\n\" characters.\n",
    "print(texts[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Hugging Face tokenizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import GPT2TokenizerFast\n",
    "from langchain_text_splitters import CharacterTextSplitter\n",
    "\n",
    "tokenizer = GPT2TokenizerFast.from_pretrained(\"gpt2\")\n",
    "\n",
    "# This is a long document we can split up.\n",
    "with open(\"../../../state_of_the_union.txt\") as f:\n",
    "    state_of_the_union = f.read()\n",
    "\n",
    "text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(\n",
    "    tokenizer, chunk_size=100, chunk_overlap=0\n",
    ")\n",
    "texts = text_splitter.split_text(state_of_the_union)\n",
    "print(texts[0])"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "langchain0_1",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
